Data Profiling

Context

The following project aims to make an exploratory analysis on the historical data of the months of January to July of the year 2018 provided by the page of ecobici; in the same link you can find what you need to access the API provided by this service. The documentation on the API can be found in the following link.

Ecobici is a public bicycle service in Mexico City aimed at the inhabitants of the capital, its surroundings and tourists.

The operation of this service allows registered users to take a bicycle from any station and return it in the closest to its destination in unlimited journeys of 45 minutes.

The way to access this service is through an annual, weekly, three-day or daily subscription.

This service is available from 5:00 a.m. to 12:00 a.m. every day of the year, which starts in 2010, February with 84 “cicloestaciones” (“ciclostations”) and 1,200 bicycles.

Currently Ecobici has more than 170 thousand registered users and the service is available in 55 colonies in Mexico City, in an area of 38 square kilometers.

The historical data was downloaded and stored in a database called ECOBICI; the scripts to export this DB can be found in the files provided in this project in the data folder.

The DB ECOBICI contains 7 tables, one for each month, that is, one for each file downloaded directly from the ecobici page; each table contains the following fields:

  • genero_usuario: Genre of the user who used the ecobici service.
  • age_user: Age of the user who used the ecobici service.
  • id_bici: Identifier of the bicycle used.
  • ciclo_estacion_retiro: Identifier of the station where the bicycle was removed.
  • re_date: Date of withdrawal of the bicycle from the last station registered for its use.
  • re_time: Bicycle withdrawal time of the last station registered for its use.
  • cycle_stacion_arribo: Identifier of the station where the bicycle is returned.
  • cycle_stacion_arribo: Date of arrival of the bicycle to the arrival station.
  • arr_time: Arrival time of the bicycle to the arrival station.

The following code is used to access a MySQL database, however what we used was directly from Excel files and Data Frames construction. Why? Only for practicity for the purpose of this document.

Importing libraries

If we would have connected to the DB, then we would execute this code:

## [1] "[db.host <- 'localhost'\ndb.user <- 'root'\ndb.port <- 3306\ndb.password <- '$<your password>'\n\n## DB Connection\ndb_connect <- function(db.name) {\n  db <- dbPool(\n    drv = RMySQL::MySQL(),\n    dbname = db.name,\n    host = db.host,\n    user = db.user,\n    password = db.password,\n    port = as.numeric(db.port)\n  )\n  \n  return(db)\n}]"

Then we would took converted each month of data into Data Frames.

## [1] "[\nJanuary <- tbl(db_connect('ECOBICI'), 'Enero') %>% collect()\nFebruary <- tbl(db_connect('ECOBICI'), Febrero') %>% collect()\nMarch <- tbl(db_connect('ECOBICI'), Marzo) %>% collect()\nApril <- tbl(db_connect('ECOBICI'), Abril) %>% collect()\nMay <- tbl(db_connect('ECOBICI'), 'Mayo') %>% collect()\nJune <- tbl(db_connect('ECOBICI'), 'Junio') %>% collect()\nJuly <- tbl(db_connect('ECOBICI'), 'Julio') %>% collect()\n]"

But instead we read the information directly from “.cvs” files, this way:

It’s important to see what type of values contain our data, to clean or addecuate values for what we need.

With the following function the data type will be modified in user_genre and the time and date data will be collapsed in a single data for the arrival times and the withdrawal times recorded.

Then through the previous function original Data Frames were modified.

… and got this new information:

Data Profiling

To make a profiling of data, we used the DataProfiling module which is found in the documents of this project.

It should be noted that edad_usuario,id_bici, ciclo_estacion_retiro,ciclo_estacion_arribo are identifiers, so, even if they are numerical type, they do not provide any relevant quantitative information, therefore we only observed characteristics over data which do not include numerical summaries.

First a data count and then a summary of each month was made.

Data count of each month

January

A table with format specifications is built, that large numbers separate them with “,” and do not use scientific notation.

uniques nan mode
genero_usuario 2 0 M
edad_usuario 67 0 28
bici 4,897 1 7376
ciclo_estacion_retiro 432 0 271
ciclo_estacion_arribo 435 0 1

For the time data it was only verified if there are null values and the count of the unique values.

uniques nan
re_datetime 545,217 0
arr_datetime 545,660 0

There weren’t null values.

January Findings

  • More than 50% of users are male.
  • The mode of users age is 28 years old.
  • The bicycle with id 7376 was the most used.
  • The busiest station to take a bicycle was the 271.
  • The busiest station to return a bicycle was the 1.

February

uniques nan mode
genero_usuario 2 0 M
edad_usuario 73 0 28
bici 5,058 44 2019
ciclo_estacion_retiro 476 0 271
ciclo_estacion_arribo 476 0 43
uniques nan
re_datetime 541,252 0
arr_datetime 541,095 0

February Findings

  • More than 50% of users are male.
  • The mode of users age is 28 years old.
  • The bicycle with id 2019 was the most used.
  • The busiest station to take a bicycle was the 271.
  • The busiest station to return a bicycle was the 43.

March

uniques nan mode
genero_usuario 2 0 M
edad_usuario 72 0 28
bici 4,931 122 2698
ciclo_estacion_retiro 476 0 271
ciclo_estacion_arribo 478 0 27
uniques nan
re_datetime 580,412 0
arr_datetime 579,596 0

March Findings

  • More than 50% of users are male.
  • The mode of users age is 28 years old.
  • The bicycle with id 2698 was the most used.
  • The busiest station to take a bicycle was the 271.
  • The busiest station to return a bicycle was the 27.

Abril

uniques nan mode
genero_usuario 2 0 M
edad_usuario 71 0 28
bici 4,889 132 11065
ciclo_estacion_retiro 478 0 27
ciclo_estacion_arribo 478 0 27
uniques nan
re_datetime 589,901 0
arr_datetime 589,669 0

April Findings

  • More than 50% of users are male.
  • The mode of users age is 28 years old.
  • The bicycle with id 11065 was the most used.
  • The busiest station to take a bicycle was the 27.
  • The busiest station to return a bicycle was the 27.

May

uniques nan mode
genero_usuario 2 0 M
edad_usuario 72 0 28
bici 4,982 0 15259
ciclo_estacion_retiro 480 0 271
ciclo_estacion_arribo 480 0 27
uniques nan
re_datetime 618,668 0
arr_datetime 618,486 0

May Findings

  • More than 50% of users are male.
  • The mode of users age is 28 years old.
  • The bicycle with id 15259 was the most used.
  • The busiest station to take a bicycle was the 271.
  • The busiest station to return a bicycle was the 27.

June

uniques nan mode
genero_usuario 2 0 M
edad_usuario 71 0 28
bici 4,980 0 2789
ciclo_estacion_retiro 479 0 271
ciclo_estacion_arribo 479 0 27
uniques nan
re_datetime 545,135 0
arr_datetime 545,241 0

June Findings

  • More than 50% of users are male.
  • The mode of users age is 28 years old.
  • The bicycle with id 2789 was the most used.
  • The busiest station to take a bicycle was the 271.
  • The busiest station to return a bicycle was the 27.

July

uniques nan mode
genero_usuario 2 0 M
edad_usuario 70 0 28
bici 4,929 0 9581
ciclo_estacion_retiro 480 0 27
ciclo_estacion_arribo 480 0 27
uniques nan
re_datetime 568,438 0
arr_datetime 569,168 0

July Findings

  • More than 50% of users are male.
  • The mode of users age is 28 years old.
  • The bicycle with id 9581 was the most used.
  • The busiest station to take a bicycle was the 27.
  • The busiest station to return a bicycle was the 27.

EDA

Thanks to the data shown before, it was possible to make an Exploratory Data Analysis.

See that station number 271 is the one with the highest number of records where users took a bicycle in the months of:

  • January, February, March, May and June.

On the other hand, station 27 is the most frequented to take a bicycle in the months of April and July. Also this station is the majority where users returned a bicycle in the months of:

  • March, April, May, June and July.

Now, several questions or issues arised about the data.

Most recurrent locations

Where are stations 27 and 271 located, as well as stations 1 and 43?

A map of all the stations can be found on the official ecobici website, although in this link the corresponding file is found.

The location of each of the 480 stations can be found in the following page.

Or it can be obtained from the API, where an access token is needed, obtained from the credentials of each user, this access token has an expiration time of one hour. (put in browser https://pubsbapi.smartbike.com/oauth/v2/token?client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&grant_type=client_credentials).

Location of each station

Before each operation, it’s important to get an API_KEY from google console, therefore we had to create a project and then enable the Google Maps API.

27 REFORMA-HAVRE

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

271 AV. CENTRAL-J. MENESES

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

1 RIO SENA-RIO BALSAS

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

43 JUAREZ y REVILLAGIGEDO

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

Summary

  • The station with more concurrence was the one that was located on “Reforma”, nearby “Reforma 222”, which is a mall and has work offices.

  • Two stations are located near of Buenavista subway station, where it is a point of connection between many people who come from the “Estado de Mexico” and other parts of Mexico City to move to the areas where the highest density of jobs is located.

  • Finally the station 43 (which at the time of writing this is out of operation) is very concurrent to have closeness to “Reforma” avenue and the downtown.

Data integration of each month

To have all the data collected in the months of January to July 2018, every data set was combined to obtain the average time that users use the ecobici service of all stations with registered usability.

The time of usability of each user was calculated:

And then this data arranged by “duracion”:

There are records where users took 0 and up to 1 second to return a bike at different stations, this is possibly due to two things mainly:

  1. If your ecobici isn’t in good condition, you have up to 2 minutes to return it.
  2. There are times when you take the ecobici and you drop by mistake, so you must request one again.

There are records where users took days, weeks and even up to 1.53 years to return a bicycle.

To avoid taking data where users had a mistake of did not return the bicycle, or they took longer than the regulation, a filter was used to take only data where there was a duration of 30 seconds from one station to another or those that comply with the established ecobici regulations duration of 45 minutes as maximum.

## [1] "The average time of use of an ecobici within the regulatory time, in Mexico City is of: 13 minutes"

With this new cleaning, new records were obtained between the main 4 stations (271, 27, 1 and 43).

Graphs about age and ubsaility of the Ecobici service.

Continuing with some other data, the following graph shows the proportion on the use of the service ecobici by registered ages.

By convention, those data in which the age is greater than 85 years was be omitted.

What are the most and least common hours of service?

This will serve to know when a higher demand for service and greater availability of bicycles is required. The parameter re_datetime is used because it is the time at which users start using a bicycle, which should not vary much during the day witharr_datetime, because the time of use with the new tables is less than 45 minutes.

To have the top 10 hours where there is greater demand for ecobici, users were counted by hour:

## # A tibble: 10 x 2
##     hora conteo
##    <int>  <int>
##  1    18 483166
##  2     8 478318
##  3    19 391346
##  4     9 370395
##  5    17 353746
##  6    14 348941
##  7    15 347330
##  8    16 287176
##  9    13 286461
## 10     7 261877

Bingo!

It is curious that these hours coincide with the usual time to enter and leave to work in Mexico City, which makes sense, because many people go to work on these vehicles.

There’s one more analysis and it’s to know which days of the week have the highest demand for bicycles, due to the number of users who take any.

## # A tibble: 7 x 2
##   dia_semana conteo
##   <chr>       <int>
## 1 Tuesday    900937
## 2 Wednesday  895546
## 3 Thursday   865172
## 4 Monday     822882
## 5 Friday     815889
## 6 Saturday   374238
## 7 Sunday     323518

It’s very clear that the activity decreases on weekends to less than half of usual and the busiest day is Tuesday.

The difference between the number of men who use this service against the female gender is remarkable.

Search for posts by hashtag #ecobici on Twitter

##  [1] "Es increíble que con @ecobici:\n\u274cEl @GobCDMX pague 180 millones anuales a Clear Channel por el servic"                                                                                               
##  [2] "Te van a criticar por todo.\n\nTú sigue pasando tu tarjeta por el lector al finalizar tus viajes aunqu"                                                                                                   
##  [3] "Les presentamos la nueva FORD EXPLORER BOLARDO la SUV de mal gusto, pero buena reversa! @BJAlcaldia "                                                                                                     
##  [4] "RT @AngeliqueMera: Es mi hermano.  \n\nTrabaja en @ecobici y desde ayer no llegó a trabajar. \n\n#TeBusc"                                                                                                 
##  [5] "RT @Reporte_Indigo: Mientras más personas se convierten en usuarios de las Ecobicis, el 30% de las b"                                                                                                     
##  [6] "RT @AngeliqueMera: \U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\n\nMi hermano trabaja en @ecobici ayer salió de casa y no llegó a traba"
##  [7] "@AngyBeckham Hola Angie, lamentamos los inconvenientes, nuestro equipo revisará lo ocurrido con la A"                                                                                                     
##  [8] "@ecobici qué onda con su app, en el mapa indica bicis disponibles y en las 4 que ya recorrí no hay n"                                                                                                     
##  [9] "\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\n\nMi hermano salió ayer rumbo a su trabajo en @ecobici y no llegó, no sabemos nada de él.… ht"                                    
## [10] "RT @XochitlGalvez: Es increíble que con @ecobici:\n\u274cEl @GobCDMX pague 180 millones anuales a Clear Ch"                                                                                               
## [11] "RT @AngeliqueMera: \U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\n\nMi hermano trabaja en @ecobici ayer salió de casa y no llegó a traba"
## [12] "RT @AngeliqueMera: \U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\n\nMi hermano trabaja en @ecobici ayer salió de casa y no llegó a traba"
## [13] "@LuisLobato Hola, trabajaremos con las instancias correspondientes para atender el reporte"                                                                                                               
## [14] "@ecobici la estación 371 Tlacoquemecatl, bloqueada por dos autos desde hace más de 25 minutos. https"                                                                                                     
## [15] "RT @AngeliqueMera: \U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\U0001f6a8\n\nMi hermano trabaja en @ecobici ayer salió de casa y no llegó a traba"